Overview

Dataset statistics

Number of variables9
Number of observations768
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory54.1 KiB
Average record size in memory72.2 B

Variable types

Numeric8
Categorical1

Warnings

Pregnancies has 111 (14.5%) zeros Zeros

Reproduction

Analysis started2021-03-06 19:06:28.181962
Analysis finished2021-03-06 19:06:55.273559
Duration27.09 seconds
Software versionpandas-profiling v2.11.0
Download configurationconfig.yaml

Variables

Pregnancies
Real number (ℝ≥0)

ZEROS

Distinct17
Distinct (%)2.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.845052083
Minimum0
Maximum17
Zeros111
Zeros (%)14.5%
Memory size6.1 KiB

Quantile statistics

Minimum0
5-th percentile0
Q11
median3
Q36
95-th percentile10
Maximum17
Range17
Interquartile range (IQR)5

Descriptive statistics

Standard deviation3.369578063
Coefficient of variation (CV)0.8763413316
Kurtosis0.1592197775
Mean3.845052083
Median Absolute Deviation (MAD)2
Skewness0.9016739792
Sum2953
Variance11.35405632
MonotocityNot monotonic
Histogram with fixed size bins (bins=17)
ValueCountFrequency (%)
1135
17.6%
0111
14.5%
2103
13.4%
375
9.8%
468
8.9%
557
7.4%
650
 
6.5%
745
 
5.9%
838
 
4.9%
928
 
3.6%
Other values (7)58
7.6%
ValueCountFrequency (%)
0111
14.5%
1135
17.6%
2103
13.4%
375
9.8%
468
8.9%
557
7.4%
650
 
6.5%
745
 
5.9%
838
 
4.9%
928
 
3.6%
ValueCountFrequency (%)
171
 
0.1%
151
 
0.1%
142
 
0.3%
1310
 
1.3%
129
 
1.2%
1111
 
1.4%
1024
3.1%
928
3.6%
838
4.9%
745
5.9%

Glucose
Real number (ℝ≥0)

Distinct137
Distinct (%)17.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean121.6973577
Minimum44
Maximum199
Zeros0
Zeros (%)0.0%
Memory size6.1 KiB

Quantile statistics

Minimum44
5-th percentile80
Q199.75
median117
Q3141
95-th percentile181
Maximum199
Range155
Interquartile range (IQR)41.25

Descriptive statistics

Standard deviation30.46200769
Coefficient of variation (CV)0.2503095241
Kurtosis-0.2686874714
Mean121.6973577
Median Absolute Deviation (MAD)20
Skewness0.5309321048
Sum93463.57069
Variance927.9339123
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
9917
 
2.2%
10017
 
2.2%
12914
 
1.8%
12514
 
1.8%
11114
 
1.8%
10614
 
1.8%
11213
 
1.7%
10513
 
1.7%
10813
 
1.7%
10213
 
1.7%
Other values (127)626
81.5%
ValueCountFrequency (%)
441
 
0.1%
561
 
0.1%
572
0.3%
611
 
0.1%
621
 
0.1%
651
 
0.1%
671
 
0.1%
683
0.4%
714
0.5%
721
 
0.1%
ValueCountFrequency (%)
1991
 
0.1%
1981
 
0.1%
1974
0.5%
1963
0.4%
1952
0.3%
1943
0.4%
1932
0.3%
1911
 
0.1%
1901
 
0.1%
1894
0.5%

BloodPressure
Real number (ℝ≥0)

Distinct48
Distinct (%)6.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean72.42814101
Minimum24
Maximum122
Zeros0
Zeros (%)0.0%
Memory size6.1 KiB

Quantile statistics

Minimum24
5-th percentile52
Q164
median72
Q380
95-th percentile90
Maximum122
Range98
Interquartile range (IQR)16

Descriptive statistics

Standard deviation12.10604379
Coefficient of variation (CV)0.1671455821
Kurtosis1.083684227
Mean72.42814101
Median Absolute Deviation (MAD)8
Skewness0.1315143054
Sum55624.8123
Variance146.5562962
MonotocityNot monotonic
Histogram with fixed size bins (bins=48)
ValueCountFrequency (%)
7057
 
7.4%
7452
 
6.8%
7845
 
5.9%
6845
 
5.9%
7244
 
5.7%
6443
 
5.6%
8040
 
5.2%
7639
 
5.1%
6037
 
4.8%
6234
 
4.4%
Other values (38)332
43.2%
ValueCountFrequency (%)
241
 
0.1%
302
 
0.3%
381
 
0.1%
401
 
0.1%
444
 
0.5%
462
 
0.3%
485
 
0.7%
5013
1.7%
5211
1.4%
5411
1.4%
ValueCountFrequency (%)
1221
 
0.1%
1141
 
0.1%
1103
0.4%
1082
0.3%
1063
0.4%
1042
0.3%
1021
 
0.1%
1003
0.4%
983
0.4%
964
0.5%

SkinThickness
Real number (ℝ≥0)

Distinct51
Distinct (%)6.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean29.24704236
Minimum7
Maximum99
Zeros0
Zeros (%)0.0%
Memory size6.1 KiB

Quantile statistics

Minimum7
5-th percentile14.35
Q125
median28
Q333
95-th percentile44
Maximum99
Range92
Interquartile range (IQR)8

Descriptive statistics

Standard deviation8.923908459
Coefficient of variation (CV)0.3051217401
Kurtosis4.895310917
Mean29.24704236
Median Absolute Deviation (MAD)5
Skewness0.7618184215
Sum22461.72853
Variance79.63614218
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
27.23545706139
18.1%
33108
 
14.1%
3231
 
4.0%
3027
 
3.5%
2723
 
3.0%
2322
 
2.9%
2820
 
2.6%
1820
 
2.6%
3119
 
2.5%
3918
 
2.3%
Other values (41)341
44.4%
ValueCountFrequency (%)
72
 
0.3%
82
 
0.3%
105
 
0.7%
116
0.8%
127
0.9%
1311
1.4%
146
0.8%
1514
1.8%
166
0.8%
1714
1.8%
ValueCountFrequency (%)
991
 
0.1%
631
 
0.1%
601
 
0.1%
561
 
0.1%
542
0.3%
522
0.3%
511
 
0.1%
503
0.4%
493
0.4%
484
0.5%

Insulin
Real number (ℝ≥0)

Distinct187
Distinct (%)24.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean157.0035269
Minimum14
Maximum846
Zeros0
Zeros (%)0.0%
Memory size6.1 KiB

Quantile statistics

Minimum14
5-th percentile50
Q1121.5
median130.2878788
Q3206.8461538
95-th percentile293
Maximum846
Range832
Interquartile range (IQR)85.34615385

Descriptive statistics

Standard deviation88.86091421
Coefficient of variation (CV)0.5659803699
Kurtosis12.08584802
Mean157.0035269
Median Absolute Deviation (MAD)41
Skewness2.62272843
Sum120578.7086
Variance7896.262074
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
130.2878788236
30.7%
206.8461538138
18.0%
10511
 
1.4%
1309
 
1.2%
1409
 
1.2%
1208
 
1.0%
1807
 
0.9%
947
 
0.9%
1007
 
0.9%
1356
 
0.8%
Other values (177)330
43.0%
ValueCountFrequency (%)
141
 
0.1%
151
 
0.1%
161
 
0.1%
182
0.3%
221
 
0.1%
232
0.3%
251
 
0.1%
291
 
0.1%
321
 
0.1%
363
0.4%
ValueCountFrequency (%)
8461
0.1%
7441
0.1%
6801
0.1%
6001
0.1%
5791
0.1%
5451
0.1%
5431
0.1%
5401
0.1%
5101
0.1%
4952
0.3%

BMI
Real number (ℝ≥0)

Distinct249
Distinct (%)32.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean32.44642005
Minimum18.2
Maximum67.1
Zeros0
Zeros (%)0.0%
Memory size6.1 KiB

Quantile statistics

Minimum18.2
5-th percentile22.235
Q127.5
median32.05
Q336.6
95-th percentile44.395
Maximum67.1
Range48.9
Interquartile range (IQR)9.1

Descriptive statistics

Standard deviation6.878969502
Coefficient of variation (CV)0.2120101229
Kurtosis0.9147639622
Mean32.44642005
Median Absolute Deviation (MAD)4.55
Skewness0.6021444767
Sum24918.8506
Variance47.32022142
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
3213
 
1.7%
31.212
 
1.6%
31.612
 
1.6%
33.310
 
1.3%
32.410
 
1.3%
30.19
 
1.2%
30.89
 
1.2%
32.89
 
1.2%
32.99
 
1.2%
30.859674139
 
1.2%
Other values (239)666
86.7%
ValueCountFrequency (%)
18.23
0.4%
18.41
 
0.1%
19.11
 
0.1%
19.31
 
0.1%
19.41
 
0.1%
19.52
0.3%
19.63
0.4%
19.91
 
0.1%
201
 
0.1%
20.11
 
0.1%
ValueCountFrequency (%)
67.11
0.1%
59.41
0.1%
57.31
0.1%
551
0.1%
53.21
0.1%
52.91
0.1%
52.32
0.3%
501
0.1%
49.71
0.1%
49.61
0.1%

DiabetesPedigreeFunction
Real number (ℝ≥0)

Distinct517
Distinct (%)67.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.4718763021
Minimum0.078
Maximum2.42
Zeros0
Zeros (%)0.0%
Memory size6.1 KiB

Quantile statistics

Minimum0.078
5-th percentile0.14035
Q10.24375
median0.3725
Q30.62625
95-th percentile1.13285
Maximum2.42
Range2.342
Interquartile range (IQR)0.3825

Descriptive statistics

Standard deviation0.331328595
Coefficient of variation (CV)0.7021513764
Kurtosis5.594953528
Mean0.4718763021
Median Absolute Deviation (MAD)0.1675
Skewness1.919911066
Sum362.401
Variance0.1097786379
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0.2586
 
0.8%
0.2546
 
0.8%
0.2385
 
0.7%
0.2595
 
0.7%
0.2685
 
0.7%
0.2615
 
0.7%
0.2075
 
0.7%
0.2454
 
0.5%
0.1674
 
0.5%
0.2994
 
0.5%
Other values (507)719
93.6%
ValueCountFrequency (%)
0.0781
0.1%
0.0841
0.1%
0.0852
0.3%
0.0882
0.3%
0.0891
0.1%
0.0921
0.1%
0.0961
0.1%
0.11
0.1%
0.1011
0.1%
0.1021
0.1%
ValueCountFrequency (%)
2.421
0.1%
2.3291
0.1%
2.2881
0.1%
2.1371
0.1%
1.8931
0.1%
1.7811
0.1%
1.7311
0.1%
1.6991
0.1%
1.6981
0.1%
1.61
0.1%

Age
Real number (ℝ≥0)

Distinct52
Distinct (%)6.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean33.24088542
Minimum21
Maximum81
Zeros0
Zeros (%)0.0%
Memory size6.1 KiB

Quantile statistics

Minimum21
5-th percentile21
Q124
median29
Q341
95-th percentile58
Maximum81
Range60
Interquartile range (IQR)17

Descriptive statistics

Standard deviation11.76023154
Coefficient of variation (CV)0.3537881556
Kurtosis0.6431588885
Mean33.24088542
Median Absolute Deviation (MAD)7
Skewness1.129596701
Sum25529
Variance138.3030459
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
2272
 
9.4%
2163
 
8.2%
2548
 
6.2%
2446
 
6.0%
2338
 
4.9%
2835
 
4.6%
2633
 
4.3%
2732
 
4.2%
2929
 
3.8%
3124
 
3.1%
Other values (42)348
45.3%
ValueCountFrequency (%)
2163
8.2%
2272
9.4%
2338
4.9%
2446
6.0%
2548
6.2%
2633
4.3%
2732
4.2%
2835
4.6%
2929
3.8%
3021
 
2.7%
ValueCountFrequency (%)
811
 
0.1%
721
 
0.1%
701
 
0.1%
692
0.3%
681
 
0.1%
673
0.4%
664
0.5%
653
0.4%
641
 
0.1%
634
0.5%

Outcome
Categorical

Distinct2
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Memory size6.1 KiB
0
500 
1
268 

Length

Max length1
Median length1
Mean length1
Min length1

Characters and Unicode

Total characters768
Distinct characters2
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row1
2nd row1
3rd row1
4th row1
5th row1
ValueCountFrequency (%)
0500
65.1%
1268
34.9%
Histogram of lengths of the category
ValueCountFrequency (%)
0500
65.1%
1268
34.9%

Most occurring characters

ValueCountFrequency (%)
0500
65.1%
1268
34.9%

Most occurring categories

ValueCountFrequency (%)
Decimal Number768
100.0%

Most frequent character per category

ValueCountFrequency (%)
0500
65.1%
1268
34.9%

Most occurring scripts

ValueCountFrequency (%)
Common768
100.0%

Most frequent character per script

ValueCountFrequency (%)
0500
65.1%
1268
34.9%

Most occurring blocks

ValueCountFrequency (%)
ASCII768
100.0%

Most frequent character per block

ValueCountFrequency (%)
0500
65.1%
1268
34.9%

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

First rows

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
06148.072.00000035.0206.84615433.6000000.627501
18183.064.00000033.0206.84615423.3000000.672321
20137.040.00000035.0168.00000043.1000002.288331
3378.050.00000032.088.00000031.0000000.248261
42197.070.00000045.0543.00000030.5000000.158531
58125.096.00000033.0206.84615435.4067670.232541
610168.074.00000033.0206.84615438.0000000.537341
71189.060.00000023.0846.00000030.1000000.398591
85166.072.00000019.0175.00000025.8000000.587511
97100.075.32142933.0206.84615430.0000000.484321

Last rows

PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
7581121.078.039.00000074.00000039.00.261280
7593108.062.024.000000130.28787926.00.223250
7607137.090.041.000000130.28787932.00.391390
7611106.076.027.235457130.28787937.50.197260
762288.058.026.00000016.00000028.40.766220
763989.062.027.235457130.28787922.50.142330
76410101.076.048.000000180.00000032.90.171630
7652122.070.027.000000130.28787936.80.340270
7665121.072.023.000000112.00000026.20.245300
767193.070.031.000000130.28787930.40.315230